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ABSTRACT 

The  results  of  observational  studies  are  often  disputed  because  of 
nonrandom  treatment  assignment.  For  example,  patients  at  greater  risk  may  be 
overrepresented  in  some  treatment  groups.  This  paper  discusses  the  central 
role  of  "propensity  scores”  and  "balancing  scores"  in  the  analysis  of 
observational  studies.  The  propensity  score  is  the  (estimated)  conditional 
probability  of  assignment  to  a  particular  treatment  given  a  vector  of  observed 
covariates.  Both  large  and  small  sample  theory  show  that  adjustment  for  the 
scalar  propensity  score  is  sufficient  to  remove  bias  due  to  all  observed 
covariates.  Applications  include:  (1)  matched  sampling  on  the  univariate 
propensity  score  which  is  equal  percent  bias  reducing  under  more  general 
conditions  than  required  for  discriminant  matching,  (2)  multivariate 
adjustment  by  subclassification  on  balancing  scores  where  the  same  subclasses 
are  used  to  estimate  treatment  effects  for  all  outcome  variables  and  in  all 

i 

subpopulations,  apd  (3)  visual  representation  of  multivariate  adjustment  by  a 
two-dimensional  plot. 
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THE  CENTRAL  ROLE  OF  THE  PROPENSITY  SCORE 
IN  OBSERVATIONAL  STUDIES  FOR  CAUSAL  EFFECTS 

Paul  R.  Rosenbaum  and  Donald  B.  Rubin 
1 .  DEFINITIONS 

An  experiment  is  defined  as  a  comparison  of  several  treatments,  any  one 
of  which  may  be  given  to  or  withheld  from  any  of  N  units  (e.g.,  medical 
patients)  under  study.  Inferences  about  the  effects  of  treatments  involve 
speculations  about  the  effect  one  treatment  would  have  had  on  a  unit  which,  in 
fact,  received  some  other  treatment.  In  a  series  of  papers,  Rubin  (e.g., 

1978)  formalized  this  concept  in  a  way  consistent  with  that  traditionally  used 
in  the  literature  of  experimental  design  (e.g.,  Fisher  (1953)  and  Kempthorne 
(1952)).  Suppose  there  are  only  two  treatments  1  and  2.  In  principle,  the 
ith  of  the  N  units  under  study  has  both  a  response  r^  that  would  have 
resulted  if  it  had  received  treatment  1,  and  a  response  rQi  that  would  have 
resulted  if  it  had  received  treatment  2.  In  this  formulation,  causal  effects 


are  comparisons  of  r^ 

and 

r0i 

(e.g. 

r1i 

"  r0i 

or  rii/roi)*  Since 

each 

unit  receives  only  one 

treatment, 

either 

r11 

or 

rQi  is  observed,  but 

not 

both,  so  comparisons  of 

r1i 

and 

r0l 

imply 

some 

degree  of  speculation. 

In 

a  sence,  estimating  the  causal  effects  of  treatments  is  a  missing  data 
problem,  since  either  r^  or  rQ^  is  missing. 

The  above  formulation  contains  some  implicit  assumptions.  For  example, 
the  response  rtl  of  unit  i  to  treatment  t  might  depend  on  the  treatment 
given  to  unit  j ,  if  tor  example,  they  compete  for  resources.  Such  a 
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situation  complicates  not  only  the  analysis,  but  the  definition  of  the  causal 
effect  as  well.  Such  problems  are  not  considered  in  this  paper.  For  a  fuller 
discussion,  see  Cox  (1958,  chapter  2)  or  Rubin  (1978,  section  2.3). 

In  this  paper,  the  N  units  in  the  study  are  viewed  as  a  simple  random 
sample  from  some  population,  and  the  average  treatment  effect  is  defined  as 

E(r  1 )  -  E(rQ)  (1.1) 

where  E( • )  denotes  expectation  in  the  population.  In  large  randomized 
experiments,  the  results  in  the  two  treatment  groups  may  often  be  directly 
compared  because  the  units  in  the  two  treatment  groups  are  likely  to  be 
similar,  whereas  in  nonrandomized  experiments,  such  direct  comparisons  may  be 
misleading  because  units  exposed  to  one  treatment  generally  differ 
systematically  from  the  units  exposed  to  the  other  treatment.  Cox  (1981,  p. 
291 )  has  observed  that  there  is  a  need  for  further  discussion  of  observational 
studies  with  particular  emphasis  on  bias  isolation  and  removal. 

For  the  1th  patient  of  N  patients  in  the  study  (i=1,...,N),  let  zi 
be  the  indicator  for  treatment  assignment,  with  zi  =  1  if  unit  i  is 
assigned  to  the  experimental  treatment,  and  z^  =  0  if  unit  i  is  assigned 
to  the  control  treatment.  Let  x^  be  a  vector  of  observed  pretreatment 
measurements  or  covariates  for  the  itJl  unit ;  all  of  the  measurements  in  x 
were  made  prior  to  treatment  assignment,  but  x  may  not  include  all 
covariates  used  to  make  treatment  assignments. 

Suppose  each  unit  can  be  assigned  a  scalar  "balancing"  score  b(x)  such 
that,  at  each  value  of  the  balancing  score,  the  distribution  of  the  observed 
covariates  x  is  the  same  for  the  treated  and  control  units;  that  is,  suppose 
b(x)  exists  such  that,  in  Dawid's  (1978)  notation, 

z  J_L  x  |  b(x )  . 
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Then,  at  each  value  of  the  balancing  score,  the  difference  between  treatment 
and  control  means  on  the  response  r  is  unconfounded  with  x,  although  it 
may  be  confounded  with  unobserved  covariates. 

We  prove  that  such  a  balancing  score  always  exists,  and  then  show  that 
easily  obtained  estimates  of  the  balancing  score  behave  like  balancing  scores; 
indeed,  in  sections  3.3  and  4.2  we  find  that  an  estimated  balancing  score  can 
produce  greater  sample  balance  than  population  balancing  score.  Moreover,  in 
section  4  we  see  that  common  methods  of  adjustment  in  observational  studies  — 
including  covariance  adjustment,  and  discriminant  matching  (Cochran  and  Rubin, 
1973)  —  implicitly  adjust  for  an  estimated  balancing  score. 

In  order  to  motivate  formally  adjustment  for  a  balancing  score,  we  must 
consider  the  sampling  distribution  of  treatment  assignments.  Let  the 
conditional  probability  of  assignment  to  treatment  one,  given  the  covariates, 
be  denoted  by 

e(x)  =  p(z  =  1  |  x)  ( 1.2) 

n  z .  1  -z . 

where  we  assume  p(z  ...«,z  I  x.,...,x  )  =  II  e(x.)  (1  -e(x.  )) 

1  n  - 1  -n  -l  -l 

i=1 

Although  this  strict  independence  assumption  is  not  essential,  it  simplifies 
notation  and  discussion.  The  function  e(x)  is  called  the  propensity  score, 
that  is,  the  propensity  towards  exposure  to  treatment  one  given  the  observed 
covariates  x. 

Randomized  and  nonrandomized  trials  differ  in  two  distinct  ways.  First, 
in  a  randomized  trial,  z^  has  a  distribution  determined  by  a  known  random 
mechanism;  therefore,  in  particular,  the  propensity  score  is  a  known  function: 
there  exists  one  accepted  specification  for  e(x).  In  a  nonrandomized 
experiment,  the  propensity  score  function  is  almost  always  unknown:  there  is 
not  one  accepted  specification  for  e(x);  however,  e(x)  may  be  estimated  from 
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observed  data,  perhaps  using  a  model  such  as  a  logit  model.  To  a  Bayesian, 
estimates  of  these  probabilities  are  posterior  predictive  probabilities  of 
assignment  to  treatment  1  for  a  unit  with  vector  x  of  covariates. 

The  second  way  randomized  trials  differ  from  nonrandomized  trials  is 
that,  in  a  randomized  trial,  x  is  known  to  contain  all  covariates  that  are 
both  used  to  assign  treatments  and  possibly  related  to  the  response  (rli# 
r0i)*  More  formally,  in  a  randomized  trial,  treatment  assignment  z^  and 
response  (r.^,  roi^'  are  known  to  ^  conditionally  independent  given  x^, 

<rU'  r0i)  1L  zi  i  •  n.3) 

Condition  (1.3)  is  usually  not  known  to  hold  in  a  nonrandomized  experiment. 
Generally,  we  shall  say  treatment  assignment  is  strongly  ignorable  given  a 
vector  of  covariates  v  if 

(ru'  roi>  lLzi  I  -±  • 

For  brevity,  when  treatment  assignment  is  strongly  ignorable  given  the 
observed  covariates  x  (i.e.,  when  (1.3)  holds),  we  shall  say  simply  that 
treatment  assignment  is  strongly  ignorable.  (Note  that  if  treatment 
assignment  is  strongly  ignorable,  then  it  is  ignorable  in  Rubin's  (1978) 
sense,  which  only  requires  that  the  probabilities  be  evaluated  at  observed 
outcomes;  however,  the  converse  is  not  true  since  strongly  ignorable  implies 
the  relationship  among  probabilities  must  hold  for  all  possible  values  of  the 


random  variables.) 
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2.  LARGE  SAMPLE  THEORY 

This  section  presents  four  theorems  whose  conclusions  may  be  summarized 
as  follows. 

(1)  The  propensity  score  is  a  balancinq  score. 

(2)  Any  score  which  is  "finer"  than  a  propensity  score  is  a  balancinq  score. 

(3)  If  treatment  assignment  is  stronqly  iqnorable  given  x,  then  it  is 
strongly  ignorable  qiven  any  balancing  score. 

(4)  At  any  value  of  a  balancinq  score,  the  difference  between  the  treatment 
and  control  means  is  an  unbiased  estimate  of  the  averaqe  treatment  effect 
at  that  value  of  the  balancinq  score  ^f_  treatment  assignment  is  stronqly 
ignorable. 

The  results  of  this  section  treat  e(x)  as  known,  and  are  therefore 
applicable  to  larqe  samples.  The  effects  of  estimating  e(x)  in  small 
samples  are  considered  in  section  3. 

Theorem  1 

Treatment  assignment  and  the  observed  covariates  are  conditionally 
independent  qiven  the  propensity  score,  that  is 

z  jj_  x  |  e(x)  . 

The  above  theorem  is  a  special  case  of  the  Theorem  2,  and  so  no  separate 
proof  is  given.  However,  Cochran  and  Rubin  (1973)  proved  this  qeneral  result 
in  the  special  case  of  multivariate  normal  covariates  x;  the  result  holds 
reqardless  of  the  distribution  of  x. 

Theorem  2 

* 

Let  h  (x)  be  a  (possibly  vector  valued)  function  of  x  which  is  finer 

* 

than  e(x)  in  the  same  sense  that  e(x)  =  f(b  (x))  for  some  function  f('). 
Then 

z  1  [  x  |  b  (x)  (2.1) 
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*  * 

In  particular,  if  b  (x)  is  scalar  valued,  then  (2.1)  asserts  that  b  (x)  is 

a  balancing  score. 

Proof:  It  is  sufficient  to  show 

p(z=1  j  x)  =  p(z=1  |  b  (x) )  . 

Recall  that,  by  definition,  e(x)  =  p(z=l  |  x)  .  Now 
p(z=1  |  b* (x )  =  c) 

=  p(z=1  |  x)  p(x  |  b  (x)  =  c )dx 

x:b  (x)=c 

=  e (x )  A/  p(x  |  b  (x)  =  c )dx 
x:b  (x)=c 

=  e(x) 

=  p(z— 1  |  x) 

as  required.  // 

Theorem  1  implies  that  if  a  subclass  of  units  or  a  matched  treatment- 
control  pair  is  homogeneous  in  e(x),  then  the  treated  and  control  units  in 
that  subclass  or  matched  pair  will  have  the  same  distribution  of  x.  Theorem 
2  implies  that  if  subclasses  or  matched  treatment-control  pairs  are 
homogeneous  in  both  e(x)  and  certain  chosen  components  of  x,  it  is  still 
reasonable  to  expect  balance  on  the  other  components  of  x  within  these 
refined  subclasses  or  matched  pairs.  The  practical  importance  of  Theorem  2 
beyond  Theorem  1  arises  because  it  is  sometimes  advantageous  to  subclassify  or 
match  not  only  for  e(x),  but  for  other  components  of  x  as  well;  in 
particular,  such  a  refined  procedure  may  be  used  to  obtain  estimates  of  the 
average  treatment  effect  in  subpopulations  defined  by  components  of  x, 

(e.g.,  males,  females). 
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Theorem  3,  below,  is  the  key  result  for  showing  that  if  treatment 
assignment  is  strongly  ignorable  then  adjustment  for  a  balancing  score  b(x) 
is  sufficient  to  produce  unbiased  estimates  of  the  average  treatment  effect 
(1.1 ). 

Theorem  3 

If  treatment  assignment  is  strongly  ignorable  given  x,  then  it  is 
strongly  ignorable  given  the  balancing  score  b(x);  that  is, 

(ry  r0)  1L  z  I  £ 

implies 

<r1'  rQ>  ii  2  I  • 

Proof :  By  assumption 

pUj,  rQ,  z,  x)  =  P<r1 ,  rQ  |  x )p ( z  |  x)p(x) 

which  equals 

p(r  ,  rQ  |  x)p(z  |  b(x))p(x)  , 

since  b(x)  is  a  balancing  score.  Then 

P(r1,  V  z  I  b<^)  =  c) 

=  /  p(r  ,  r  |  x )p ( z  |  b(x)  =  c)p(x)dx 

x:b(x)=c  ° 

=  p ( z  |  b(x)  =  c)  /  p(r  ,  r  |  x)p(x)dx 
x:b(x)=c 

=  p(z  |  b(x)  =  c )p(r 1 ,  rQ  |  b(x)  =  c) 
as  required.  // 

We  are  now  ready  to  relate  balancing  scores  and  ignorable  treatment 
assignment  to  the  estimation  of  treatment  effects. 
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The  response  rt  to  treatment  t  is  observed  only  if  the  unit  receives 
treatment  t  (i.e.,  z  =  t).  Thus,  if  a  randomly  selected  treated  unit 
(z  =  1)  is  compared  to  a  randomly  selected  control  unit  (z  =  0),  the 
expected  difference  in  response  is 

E(ri  |  z  =  1)  -  E(rQ  |  z  =  0)  .  (2.2) 

Expression  (2.2)  does  not  equal  (1.1)  in  general  because  the  available  samples 
are  not  from  the  marginal  distribution  of  rt,  but  rather  from  the 
conditional  distribution  of  rfc  given  z  =  t.  In  other  words,  in  general, 
randomly  selected  units  cannot  act  as  controls  for  one  another;  i.e.  the 
expected  difference  in  their  responses  does  not  generally  equal  the  average 
treatment  effect. 

Suppose  a  specific  value  of  the  vector  of  covariates  x  is  randomly 
sampled  from  the  entire  population  of  units — both  treated  and  control  units 
together — and  then  a  treated  unit  and  a  control  unit  are  found  both  having 
this  value  for  the  vector  of  covariates.  In  this  two  step  sampling  process, 
the  expected  difference  in  response  is 

Ex(E(r1  |  x,  z  =  1)  -  E(r0  |  x,  z  =  0))  ,  (2.3) 

where  E^  denotes  expectation  with  respect  to  the  distribution  of  x  in  the 
entire  population  of  units.  If  treatment  assignment  is  strongly  ignorable, 
that  is  if  (1.3)  holds,  then  (2.3)  equals 

^*x ^ E(r •)  I  *)  ~  E^r0  I  ' 

which  does  equal  the  average  treatment  effect  (1.1).  In  other  words,  with 
strongly  ignorable  treatment  assignment,  two  units  with  the  same  x  but 
different  treatments  can  act  as  controls  for  one  another;  i.e.,  the  expected 
difference  in  their  responses  equals  the  average  treatment  effect.  This 
formal  observation  is  due  to  Rubin  (1977),  although  it  is  implicit  in  earlier 
discussions  of  experimental  design  (e.g.,  Cox,  1958,  Chapter  2). 


Now  suppose  a  value  ot  a  balancing  score  b(x)  is  sampled  from  the 
entire  population  of  units  and  then  a  treated  unit  and  a  control  unit  are 
sampled  from  all  patients  having  this  value  of  b(x),  but  perhaps  different 
values  of  x.  Given  strongly  ignorable  treatment  assignment,  it  follows  from 
Theorem  3  that 

E{r1  )  b(x  ) ,  2  =  1)  -  E(rQ  |  b(x),  z  =  0) 

=  E(r j  |  b(x) )  -  E(r0  |  b(x)) 

from  which  it  follows  that 

Efo (x ) [E(  r  ^  |  b(x ) ,  z  =  1)-  E(rQ  |  b(x),  z  =  0)] 

=  Eb(x)(E(r1  |  b ( x ) )  -  E(r0  |  b(x))]  (2.4) 

=  K(ri  -  ro> 

where  E  denotes  expectation  with  respect  to  the  distribution  of  b(x) 

D  \  X  J 

l n  the  entire  population.  In  words,  under  strongly  ignorable  treatment 
assignment,  units  with  the  same  value  of  the  balancing  score  b(x)  but 
different  treatments  can  act  as  controls  for  each  other,  in  the  sense  that  the 
expected  difference  in  their  responses  equals  the  average  treatment  effect. 

The  above  argument  has  established  the  following  theorem. 

Theorem  4. 

Suppose  treatment  assignment  is  strongly  ignorable.  Suppose  further  that 
a  group  of  patients  is  sampled  using  x  such  that  (1)  b(x)  is  constant  for 
all  patients  in  the  group,  and  (2)  at  least  one  patient  received  each 
treatment.  Then,  for  these  patients,  the  expected  difference  in  treatment 
means  equals  the  average  treatment  effect  at  that  value  of  b(x);  that  is, 

E(ri  |  b (x ) ,  z  =  1)  -  E(r  ^  |  b(x),  z  =  0) 

=  E( r ^  -  r Q  |  b(x))  . 
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3.  PRACTICAL  ISSUES  IN  THE  USE  OF  BALANCING  SCORES 


In  practice,  homogeneous  subclasses  and  exact  matches  on  balancing  scores 
are  difficult  to  obtain,  and  moreover  the  balancing  scores  must  be  estimated 
from  the  data.  This  section  considers  three  practical  issues:  the  effects  of 
imperfect  control  for  balancing  scores;  models  for  the  propensity  score 
e(x);  the  effects  in  small  samples  of  using  an  estimated  balancing  score. 

3. 1  Consequences  of  Imperfect  Control  for  Balancing  Scores 
Since  it  is  generally  difficult  in  practice  to  find  treated  and  control 
units  for  comparison  with  exactly  the  same  value  of  the  propensity  score, 
units  with  similar  but  not  identical  values  of  the  propensity  score  may  be 
required.  Theorem  5  below  shows  that  if  e(x)  is  a  close  approximation  to 
e(x)  then  subclassification  on  e(x)  almost  balances  x.  For  example, 

*v 

e(x)  might  be  a  continuous  function  of  continuous  x,  whereas  e(x)  might 
be  a  discrete  approximation  to  e(x)  used  to  define  a  few  subclasses  within 
which  e(x)  is  constant. 

Theorem  5 

Suppose  |e(x)  -  e(x)|  <  €  for  all  x.  Then 

|p(z,  x|e(x))  -  p(z |e(x) )p(x|e(x) ) |  <  2c  for  all  x  . 

Proof:  Since 

| p (A | B,C )  -  p(A|c) | 

>  |p(a|b,C)  -  p(A|c) I  p(B|C) 

=  |p(a,b)c)  -  p(a|c)  p(b|c)| 

it  is  sufficient  to  show  that 

|p(z=l|x)  -  p(z=1 |e(x) ) I  <  2e  . 

Now,  pick  an  x,  and  let  c  be  the  value  of  e(x).  Then 

|p(z=l|x)  -  p(z=l|e(x)  =  c ) | 

=  |e(x)  -  ^  /  p(z=1 | v)p(v|e (v)  =  c)dv| 

v:e(v)=c 
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=  | e ( x )  -  e (x )  +  /  (e(v)  -  e (v ) ]p( v |e (v)  =  c)dv| 

v:e(v)=c 

<  2 e  as  required.  // 

3.2  The  Form  of  the  Propensity  Score  Under  Various  Models 
Often  the  propensity  scores  e(x)  must  be  estimated  from  available 
data.  Therefore,  it  is  convenient  to  note  that  the  propensity  score  has 
familiar  forms  under  certain  familiar  models;  in  particular,  the  propensity 
score  can  often  be  modelled  using  an  appropriate  logit  model  (Cox  (1970))  or 
discriminant  score. 

Clearly, 

p(z  =  1  )p(jc  |  z  =  1 ) 

e ( — )  -  P^Z  1  ^  ^  p(z  =  1 ) p ( x  |  z  =  1)  +  p(z  =  2)p(x  |  z  =  0) 
Elementary  manipulations  establish  the  following  facts. 

1.  It  p(x|z=t)=N(U,E)  then  e(x)  is  a  monotone  function  of 

-  p  “t  - 

T  —  1 

the  linear  discriminant  x  E  (i^  -  |i  ).  Therefore,  stratification  or 
matching  on  e(x)  includes  discriminant  matching  as  a  special  case.  This 
method  was  first  proposed  by  Cochran  and  Rubin  (1973),  and  was  studied  further 
in  Rubin  (1976,  1979,  1980). 

2.  If  p(x  |  z  =  t)  is  a  polynomial  exponential  family  distribution, 
i . e . ,  if 

p ( x  |  z  =  t)  =  h(x)exp(P  (x)) 

where  P^fx)  is  a  polynomial  in  x  of  degree  k,  say,  then  e(x)  obeys  a 
polynomial  logit  model 


log 


e 


e  (x ) 

1  -  e (x  ) 


log 

’e 


p(z  =  D 
1  -  p(z  =  1) 


+  Pl(x) 


P2(x) 


=  log 
^e 


P(z  =  1  ) 

1  -  p(Z  =  1) 


P(x) 
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where  Q(x)  is  a  degree  k  polynomial  in  x. 

This  polynomial  exponential  family  includes  the  linear  exponential  family 
(resulting  in  a  linear  logit  model  for  e(x)),  the  quadratic  exponential 
family  described  in  Dempster  (1971),  and  the  binary  data  model  described  in 
Cox  (1975). 

3. 3  Small  Sample  Theory  for  Discrete  x 

This  section  demonstrates  that  subclassification  on  a  sample  estimate 
e(x)  of  the  propensity  score  e(x)  produces  sample  balance,  that  is  balance 
in  terms  of  the  sample  or  empirical  distributions.  Although  the  theorems  of 
this  section  are  formally  correct  without  any  assumption  about  x,  they  are 
of  practical  value  only  when  x  is  discrete. 

The  observed  treatment  assignments  and  covariates  are  z^,  x^, 
i  =  1,...,n.  For  conditions  A,  B,  C,  ...  ,  let  #(A,  B,  C,  ...)  be  the 
number  of  vectors  (z^,  xi )  which  satisfy  all  of  A  and  B  and  C  and 
...  .  For  example,  #(z=1,  x  =  (1,0))  is  the  number  of  vectors  (z^,  xi ) 
such  that  z^  =  1,  x^  =»  (1,0).  Define  the  sample  conditional  proportion 
p(A|B)  by 

prop(A|B)  =  if  #(B)  ^  0 

and  leave  prop(A|B)  undefined  if  #(B)  =  0. 

Estimate  e(x)  by  e(a)  =  prop(z=1  |  x  =  a).  If  e(a)  =  0  or  1  then 
all  units  with  x  =>  a  received  the  same  treatment.  Theorem  1',  which 
parallels  theorem  1,  shows  that  at  all  intermediate  values  of  e(a),  that  is 
for  e(a)  6  (0,1),  there  is  balance. 

Theorem  1'.  Suppose  e(a)  6  (0,1).  Then 

prop(z=b,  x  =a  |  e(x)  =e(a)) 

=  prop(z=b  j  e(x)  =  e(a))  prop(x=a  |  e(x)  =  e(a)) 
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Proof:  Since  if  #{B,C)  f  0,  then 


|prop(A|B,C)  -  prop(A | B) |  •  prop(BjC) 

I # (A, B, C)  # ( A , B) ,  #  # (B, C ) 

'  # (B, C)  ~  # ( B)  1  # (B) 

i #(A,B,C)  #(A,B)#(B,C) , 

1  #(B )  "  # ( B ) # ( B )  1 

=  |prop(A,C  j  B)  -  prop(A|B)prop(C|B) |  , 

it  is  sufficient  to  show 

prop(z=1 | x=a )  =  prop(z=1 |e (x )  =e(a)) 

Now, 

prop(z=1 |e(x)  =  c)  =  £  prop(z=1 |x=d)prop(x=d |e(x)  =  e(d)) 

d:  e (d )=c 

=  prop(z=1  |x=d )  I  prop(x=>d  |e(x)  =  e(d)) 
d:e (d )=c 


=  prop(z=1 |x=d) 
as  desired.  // 

Similar  theorems  and  proofs  about  sample  balance  parallel  theorems  2  and 

A*  A 

5.  In  particular,  in  parallel  with  theorem  5,  if  e  (x)  is  close  to  e(x) 

A  * 

for  all  x,  then  there  is  nearly  sample  balance  at  each  value  of  e  (x).  For 

A  * 

example  e  (x)  might  result  from  a  logit  model  which  closely  fits  the  sample 
data. 
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4.  THREE  APPLICATIONS  OF  PROPENSITY  SCORES  TO 
OBSERVATIONAL  GROUPS 

The  general  results  that  we  have  presented  suggest  that,  in  practice, 
adjustment  for  the  propensity  score  should  be  an  important  component  of 
analysis  of  observational  studies,  because  evidence  of  residual  bias  in  the 
propensity  score  is  evidence  of  potential  bias  in  estimated  treatment 
effects.  We  conclude  with  three  examples  of  how  propensity  scores  can  be 
explicitly  used  to  adjust  for  confounding  variables  in  observational 
studies.  The  examples  involve  the  three  standard  techniques  for  adjustment  in 
observational  studies  noted  by  Cochran  (e.g.,  1965)  and  summarized  by  Rubin 
(1981),  namely,  matched  sampling,  subclassification,  and  covariance 
adjustment. 


4.1  The  Use  of  Propensity  Scores  to  Construct  Matched  Samples 

from  Treatment  Groups 

Matching  is  a  method  of  sampling  from  a  large  reservoir  of  potential 
controls  to  produce  a  control  group  of  modest  size  in  which  the  distribution 
of  covariates  is  similar  to  the  distribution  in  the  treated  group.  Some 
sampling  of  the  control  reservoir  is  often  required  to  control  costs 
associated  with  measuring  the  response,  for  example,  costs  associated  with 
extensive  follow-up  of  patients  in  clinical  studies. 

Although  there  exist  model-based  alternatives  to  matched  sampling  (e.g. 
covariance  adjustment  on  random  samples),  there  are  several  reasons  why 
matching  is  appealing. 

1.  Matched  treatment  and  control  pairs  allow  relatively  unsophisticated 
researchers  to  immediately  appreciate  the  equivalence  of  treatment  and  control 
groups,  as  well  as  to  perform  simple  matched  pair  analyses  which  adjust  for 


confounding  variables.  This  issue  is  discussed  in  greater  detail  below  in 
subsection  4.2  on  balanced  subclassification. 

2.  Even  if  the  model  underlying  a  statistical  adjustment  is  correct,  the 
variance  of  the  estimate  of  the  average  treatment  effect  (1.1)  will  be  less  in 
matched  samples  than  in  random  samples  since  the  distribution  of  x  in 
treated  and  control  qroups  is  more  similar  in  matched  than  in  random 
samples.  To  verify  this,  inspect  the  formula  for  the  variance  of  the 
covariance  adjusted  estimate  (e.g.  Snedecor  and  Cochran,  1978,  p.  368),  and 
note  that  the  variance  decreases  as  the  difference  between  treatment  and 
control  means  on  x  decreases. 

3.  Model  based  adjustment  on  matched  samples  is  usually  to  be  more  robust  to 
departures  from  the  assumed  form  of  the  underlying  model  than  model-based 
adjustment  on  random  samples  (cf.  Rubin,  1973b,  1979),  primarily  because  of 
the  more  limited  reliance  on  the  model  and  its  extrapolation. 

4.  In  studies  with  limited  resources  but  large  control  resevoirs  and  many 
confounding  variables,  the  confounding  variables  can  often  be  controlled  by 
multivariate  matching,  but  the  small  sample  sizes  in  the  final  groups  do  not 
allow  control  of  all  variables  by  model-based  methods. 

A  multivariate  matching  method  is  said  (Rubin,  1976a, b)  to  be  equal 
percent  bias  reducing  ( EPBR)  if  the  bias  in  each  coordinate  of  x  is  reduced 
by  the  same  percentage.  Matching  methods  which  are  not  EPBR  have  the 
potential,  ly  undesirable  property  that  they  increase  the  bias  for  some  linear 
functions  of  x.  If  matched  sampling  is  performed  before  the  reponse  (r.j, 
r,)  can  be  measured,  and  if  all  that  is  suspected  about  the  relation  between 
(r^,  t^)  and  is  that  it  is  approximately  linear,  then  EPBR  matching 

methods  are  reasonable  in  that  they  lead  to  differences  in  mean  response  in 
matched  samples  that  should  be  less  biased  than  in  random  samples. 
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In  section  2  we  observed  that  discriminant  matching  is  equivalent  to 
matching  on  the  propensity  score  if  the  covariates  x  have  a  multivariate 
normal  distribution.  Assuming  multivariate  normality,  Rubin  (1976a)  showed 
that  matching  on  the  population  or  sample  discriminant  is  EPBR.  We  now  show 
that  matching  on  the  population  propensity  score  is  EPBR  under  weaker 
distributional  assumptions.  It  is  assumed  that  the  matching  algorithm  matches 
each  treated  (z  =  1 )  unit  with  a  control  (z  *  0)  unit  drawn  from  a  resevoir 
of  control  units  on  the  basis  of  the  propensity  score. 

For  convenience  write  e  for  the  propensity  score  e(x).  The  initial 
bias  in  x  is 

b  =  E(x  |  z  =  1)  -  E(x  |  z  =  0)  . 

Suppose  we  have  a  random  sample  of  treated  (z*1 )  units  and  a  large  reservoir 
of  randomly  sampled  control  units,  and  suppose  each  treated  unit  is  matched 
with  a  control  unit  from  the  reservoir.  Then  the  expected  bias  in  matched 
samples  is 

bm  =  E(xjz*1)  -  (xj z»0 )  , 

where  the  subscript  m  indicates  the  distribution  in  matched  samples.  Thus 
the  reduction  in  bias  of  x.  due  to  matching  is 

(4.1)  b_  -  ^  -  EmU|z=0)  -  E(xjz=0)  . 

Theorem  6. 

For  any  matching  method  that  uses  e  alone  to  match  each  treated  unit 
(z=1 )  with  a  control  unit  (z=2),  the  reduction  in  bias  is 

(4.2)  b  -  b  =  /  E(x|e)[p  (e|z=0)  -  p(e|z=0)]de  . 

Proof:  From  (4.1)  we  have 

(4.3)  b  -  b  =  / [E  (x|z=0,e)p  <e|z*0)  -  E(x |z=0,e)p(e |z=0) ]de  . 

—  -m  m  —  m  — 

For  any  matching  method  satisfying  the  condition  of  the  theorem, 

(4.4)  E^(x | z=0,e )  =  E( x | z=0,e ) 
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because  any  matching  method  using  e  alone  to  match  units  alters  the  marginal 
distribution  of  e  in  the  control  group  (z=2),  but  does  not  alter  the 
conditional  distribution  of  x_  given  e  in  the  control  group. 

However,  by  Theorem  1, 

(4.5)  E(x|z=o,e)  =  E(x|e)  . 

Substitution  of  (4.4)  and  (4.5)  into  equation  (4.3)  yields  the  result  (4.2).  // 
Corollary:  If  E(x|e)  =  a  +  0e  for  some  vectors  a  and  0,  then  matching 
on  the  propensity  score  e  alone  is  EPBR. 

Proof:  The  percent  reduction  in  bias  for  the  th  coordinate  of  x  is,  from 
equation  (4.2), 

0.  (E  (e  |  z  =  0)  -  E(e  |  z  =  0)] 
i  ro _ 

0i[E(e  |  z  =  1 )  -  E(e  |  z  =  0)) 

which  is  independent  of  i,  as  required.  // 

Rubin's  (1979)  simulation  study  examines  the  small  sample  properties  of 
discriminant  matching  in  the  case  of  normal  covariates  with  possibly  different 
covariances  in  the  treatment  groups,  so  the  study  includes  situations  where 
the  true  propensity  score  is  a  quadratic  function  of  x,  but  the  discriminant 
score  is  a  linear  function  of  x.  Table  1  presents  previously  unpublished 
results  from  Rubin's  (1979)  study  for  situations  in  which  the  propensity  score 
is  a  monotone  function  of  the  linear  discriminant,  so  propensity  matching  and 
discriminant  matching  are  effectively  the  same.  The  covariates  x  are 
bivariate  normal  with  common  covariance  matrix  I  and  bias  B  along  the 
standardized  population  discriminant.  In  the  simulation,  fifty  treated  units 
are  matched  using  nearest  available  matching  (Cochran  and  Rubin  (1973))  on  the 
sample  discriminant  with  50  control  units  drawn  from  a  resevoir  of  50R 
potential  control  units,  for  R  =  2,3,4;  details  are  found  in  Rubin  (1979). 
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Assuming  parallel  linear  response  surfaces,  table  1  shows  that  even  in 
the  absence  of  additional  adjustments,  propensity  (discriminant)  matching 
alone  can  remove  most  of  the  bias  if  the  resevoir  is  relatively  large. 
Moreover,  table  1  shows  that  the  population  and  sample  propensity  scores  are 
about  equally  effective  in  removing  bias,  so  no  substantial  loss  is  incurred 
by  having  to  estimate  the  propensity  score.  It  should  be  noted  that  the 
conditions  underlying  table  1  differ  from  the  conditions  underlying  theorem  1 
in  as  much  as  nearest  available  matching  provides  only  a  partial  adjustment 
for  the  propensity  score  since  exact  matches  are  not  generally  obtained. 

Propensity  matching  should  prove  especially  effective  relative  to 
Mahalanobis  metric  matching  (Cochran  and  Rubin  (1973),  Rubin  (1976a, b,  1979, 
1980))  in  situations  where  markedly  nonspherically  distributed  x_  maJce  the 
use  of  a  quadratic  metric  unnatural  as  a  measure  of  distance  between  treated 
and  control  units.  For  example,  we  have  found  in  practice  that  if  x_ 
contains  one  coordinate  representing  a  rare  binary  event,  then  Mahalanobis 
metric  matching  may  try  too  hard  to  exactly  match  that  coordinate,  thereby 
reducing  the  quality  of  matches  on  the  other  coordinates  of  x.  Propensity 
matching  can  effectively  balance  rare  binary  variables  for  which  it  is  not 
possible  to  adequately  match  treated  and  control  units  on  an  individual  basis. 
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Table  1 


Percent  Reduction  in  Bias  Due  to  Matched  Sampling 
Based  on  the  Sample  and  Population  Propensity  Scores* 


Initial  Bias  (B) 

.25  .50  .75  1 . no 


Ratio  of  Size  of 
Control  Resevoir 
to  Size  of  Treatment 
Group 

2 


3 


4 


Propensity 
Score  Used 
for  Matching 

Sample 

92 

Population 

92 

Sample 

101 

Population 

96 

Sample 

97 

Population 

98 

85 

77 

67 

87 

78 

69 

96 

91 

83 

95 

91 

84 

98 

95 

90 

97 

94 

89 

Assuming  bivariate  normal  covariates  with  common  covariance  natri<, 
parallel  linear  response  surfaces,  sample  size  of  50  in  treated  and  control 
groups.  Estimated  percent  reduction  in  bias  from  Rubin's  (1979)  study.  The 
largest  estimated  standard  error  for  this  table  is  less  than  .03. 
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4. 2  Subclassification  on  Propensity  Scores 


A  second  major  method  of  adjustment  for  confounding  variables  is 
subclassification,  in  which  experimental  and  control  units  are  divided  on  the 
basis  of  x  into  subclasses  or  strata  (Cochran  (1965,  1968),  Cochran  and 
Rubin  (1973)).  Direct  adjustment  with  subclass  total  weights  can  be  applied 
to  the  subclass  differences  in  response  to  estimate  the  average  treatment 
effect  (1.1)  whenever  treatment  assignment  is  strongly  ignorable  (theorem  4) 
without  modelling  assumptions  such  as  parallel  linear  response  surfaces. 

As  a  method  of  multivariate  adjustment,  subclassification  has  the 
advantage  that  it  involves  direct  comparisons  of  ostensibly  comparable  groups 
of  patients  within  each  subclasses  and  therefore  can  be  both  understandable 
and  pursuasive  to  an  audience  with  limited  statistical  training.  The 
comparability  of  patients  within  strata  can  be  verified  by  the  simplest 
methods,  such  as  bar  charts  of  means.  Since  the  results  of  observational 
studies  are  often  disputed,  and  since  such  disputes  are  not  always  confined  to 
statistically  sophisticated  participants  and  audiences,  correct  results  should 
be  presented  in  a  manner  which  is  both  persuasive  and  understandable  to  the 
study's  audience.  Cox  (1981,  p.  291)  emphasizes  the  importance  of  presenting 
results  in  ways  that  are  "vivid,  simple,  and  accurate."  Of  course,  it  should 
be  stressed  that  balance  on  observed  covariates  x  does  not  imply  balance  on 
unobserved  covariates. 

A  major  problem  with  subclassification,  noted  by  Cochran  (1965),  is  that 
as  the  number  of  confounding  variables  increases,  the  number  of  subclasses 
grows  dramatically,  so  that  even  with  only  two  categories  per  variable, 
yielding  2^  subclasses  for  p  variables,  most  subclasses  will  not  have  both 
treatment  and  control  units.  Subclassification  on  the  propensity  score  is  a 
natural  way  to  obviate  this  problem. 
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We  now  use  an  estimate  of  the  propensity  score  to  subclassify  patients  in 


an  actual  observational  study  of  therapies  for  coronary  artery  disease.  The 
treatments  are  coronary  artery  bypass  surgery  (z  =  1 )  and  drug  therapy  (z  =  0). 
The  covariates  x  are  clinical,  hemodynamic,  and  demographic  measurements  on 
each  patient  made  prir  to  treatment  assignment.  Even  though  the  covariates 
have  quite  different  distributions  in  the  two  treatment  groups,  within  each  of 
the  five  subclasses,  the  surgical  and  drug  patients  will  be  seen  to  have 
similar  sample  distributions  of  x. 

The  propensity  score  was  estimated  using  a  logit  model  for  z  given  x. 
Covariates  and  interactions  among  covariates  were  selected  for  the  model  using 
a  stepwise  procedure.  Based  on  Cochran's  (1968)  observation  that 
subclassification  with  five  subclasses  is  sufficient  to  remove  at  least  90%  of 
the  bias  for  many  continuous  distributions,  five  subclasses  of  equal  size  were 
constructed  at  the  quintiles  of  sample  distribution  of  the  propensity  score, 
each  containing  303  patients.  Beginning  with  the  subclass  with  the  highest 
propensity  scores,  the  five  subclasses  contained  234  surgical  patients,  164 
surgical  patients,  98  surgical  patients,  68  surgical  patients  and  26  surgical 
patients,  repectively. 

For  each  of  the  74  covariates,  table  2  summarizes  the  balance  before  and 
after  subclassification.  The  column  labeled  "2-Sample"  in  table  2  contains  F- 
statistics,  that  is  the  square  of  the  usual  two-sample  t-statistics  for 
comparing  the  surgical  group  and  drug  group  means  of  each  variable  prior  to 
subclassif ication.  The  last  two  columns  of  table  2  contain  F-statistics  for 
the  main  effect  of  treatment  and  for  the  interaction  in  a  2><5,  treatments 
by  subclasses  analysis  of  variance,  performed  for  each  covariate.  It  is 
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Table  2a 


F-TESTS  OF  BALANCE  BEFORE  AND  AFTER  STRATIFICATION 


_ 2-Way  Anova _ 

Main 

Variable  2-Sample  Effect  Interaction 


4.4 

0.0 

0.7 

18.1 

0.0 

0.7 

6.8 

0.0 

1.4 

25.0 

0.2 

0.8 

5.3 

1.0 

0.9 

7.3 

2.2 

1.2 

26.0 

0.2 

0.3 

10.9 

1.6 

0.5 

11.6 

1.2 

1.0 

6.8 

0.1 

1.2 

38.4 

0.4 

1.4 

9.0 

0.1 

2.9* 

6.8 

0.0 

0.8 

7.3 

0.1 

0.4 

4.4 

0.0 

0.2 

23.0 

0.0 

0.6 

10.2 

0.3 

1.1 

31.4 

0.1 

2.2 

4.8 

0.1 

0.7 

6.2 

0.1 

0.9 

20.2 

0.2 

1.3 

7.8 

0.5 

0.9 

10.2 

0.6 

0.8 

4.8 

0.2 

0.0 

6.8 

0.0 

1.3 

25.0 

0.2 

0.0 

10.9 

0.2 

C.3 

10.9 

0.2 

0.2 

4.0 

0.0 

1.3 

5.8 

0.1 

0.1 

8.4 

0.3 

0.5 

13.0 

0.1 

0.2 

13.0 

2.1 

0.4 

16.0 

0.1 

1.4 

24.0 

0.3 

0.1 

16.0 

1.0 

0.2 

9.6 

0.7 

0.4 

10.9 

0.7 

0.2 

4.0 

0.2 

0.8 

14.4 

0.1 

0.4 

7.8 

0.7 

0.8 

51.8 

0.4 

0.9 

14.4 

0.1 

0.4 

9.6 

1.0 

1.3 

29.2 

0.3 

0.4 

4.3 

0.5 

0.8 

18.5 

0.3 

2.2 

7.8 

0.4 

0.5 

15.2 

0.4 

0.2 
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Table  2b 


F-TESTS  OF  BALANCE  BEFORE  AND  AFTER  STRATIFICATION 


2-Way  Anova 


Variable 

2-Sample 

Main 

Effect 

Interaction 

50 

5.8 

0.0 

0.8 

51 

19.4 

0.0 

0.3 

52 

5.8 

2.3 

1.4 

53 

13.0 

3.6 

2.0 

54 

6.2 

0.6 

0.8 

55 

8.4 

0.9 

0.9 

56 

16.0 

1.1 

0.4 

57 

5.8 

1.6 

0.8 

58 

6.8 

0.2 

2.4 

59 

4.8 

0.5 

0.9 

60 

14.4 

0.2 

0.6 

61 

7.8 

0.0 

1.1 

62 

22.1 

0.3 

0.2 

63 

6.2 

1.0 

1.2 

64 

11.6 

0.2 

0.3 

65 

18.5 

0.3 

0.2 

66 

43.6 

0.1 

1.4 

67 

31.4 

0.0 

1.0 

68 

18.5 

0.0 

0.7 

69 

13.0 

0.8 

2.3 

70 

10.9 

0.0 

2.1 

71 

10.9 

0.0 

2.4 

72 

11.6 

0.4 

1.4 

73 

16.8 

0.0 

1.2 

74 

7.8 

3.1 

0.5 
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easily  seen  that  there  is  considerable  imbalance  prior  to  subclassif ication , 
and  yet  within  subclasses  there  is  greater  balance  than  would  have  been 
expected  if  treatments  had  been  assigned  at  random  within  each  subclass. 

Subclassification  on  the  propensity  score  is  not  the  same  as  any  of  the 
several  methods  proposed  by  Miettinen  (1976):  the  propensity  score  is  not 
generally  a  "confounder"  score.  (For  example,  one  of  Miettinen’s  confounder 
scores  is  p(z=1 | rz=1 ,x )  p(z=l|x)  =  e(x).) 

4. 3  Propensity  Scores  and  Covariance  Adjustment 

The  third  standard  method  of  adjustment  in  observational  studies  is 
covariance  adjustment.  The  point  estimate  of  the  treatment  effect  obtained 
from  analysis  of  covariance  adjustment  for  multivariate  x  is,  in  fact,  equal 
to  the  estimate  obtained  from  univariate  covariance  adjustment  for  the  sample 
linear  discriminant  based  on  x,  whenever  the  same  sample  covariance  matrix 
is  used  for  both  the  covariance  adjustment  and  the  discriminant  analysis. 

This  fact  is  most  easily  demonstrated  by  linearly  transforming  x  to  (d,v) 
where  d  is  the  sample  discriminant,  and  v  is  orthogonal  to  the  sample 
discriminant  and  thus  has  the  same  sample  mean  in  both  groups.  Since 
covariance  adjustment  is  effectively  adjustment  for  the  linear  discriminant, 
plots  of  the  responses  r^  and  roi  or  residuals  r^  -  r^  (where  rki 
is  the  value  of  r^  predicted  from  the  regression  model  used  in  the 
covariance  adjustment)  vs  the  linear  discriminant  are  useful  in  identifying 
nonlinear  or  nonparallel  response  surfaces,  as  well  as  extrapolations,  which 
might  distort  the  estimate  of  the  average  treatment  effect.  Furthermore,  such 
a  plot  is  a  bivariate  display  of  multivariate  adjustment,  and  as  such  might  be 
useful  for  general  presentation. 
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Generally,  plots  of  responses  and  residuals  from  covariance  analysis 
against  the  propensity  score  e(x)  are  more  appropriate  then  against  the 
discriminant,  unless  of  course  the  covariates  are  multivariate  normal  with 
common  covariance  matrix  in  which  case  the  propensity  score  is  a  monotone 
function  of  the  discriminant.  The  reason  is  that  if  treatment  assignment  is 
strongly  ignorable,  at  each  e(x)  the  expected  difference  in  response 
E(r1  |  z=1,  e(x))  -  E(rQ  |  z  =  2,  e(x))  equals  the  average  treatment  effect 
at  e(x),  namely  Efr^etx))  -  E(rQ|e(x)).  This  property  holds  for  the 
propensity  score  e(x)  and  for  any  balancing  score  b(x_),  but  does  not 
generally  hold  for  other  functions  of  xj  generally,  plots  against  other 
functions  of  x  are  still  confounded  by  x.  Consequently,  a  plot  of  the 
responses  r 1 ,  r2  or  residuals  against  e(x)  can  reveal  particularly 
important  nonparallelism,  nonlinearity,  or  extrapolations  in  the  response 
surfaces.  Since  the  purpose  of  such  a  plot  is  to  reveal  departures  from 
assumptions,  some  enhancement  to  accentuate  trends  in  the  plot  will  often  be 
necessary  using,  perhaps,  techniques  such  as  described  by  Cleveland  and 
Kleiner  (1975). 

Cases  where  covariance  adjustment  has  been  seen  to  perform  quite  poorly 
are  precisely  those  cases  in  which  the  linear  discriminant  is  not  a  monotone 
function  of  the  propensity  score,  so  that  covariance  adjustment  is  implicitly 
adjusting  for  a  poor  approximation  to  the  propensity  score.  In  the  case  of 
univariate  x,  the  linear  discriminant  is  a  linear  function  of  x,  whereas 
the  propenis ty  score  may  not  be  a  monotone  function  of  x  if  the  variances 
of  x  in  the  treated  and  control  groups  are  unequal.  (Intuitively,  if  the 
variance  of  x  in  the  control  group  is  much  larger  than  the  variance  in  the 
treated  group,  then  individuals  with  the  largest  and  smallest  x  values 
usually  come  trom  the  control  group.)  Rubin  (1973b,  tables  4  and  6,  with 
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A 

r=1,  and  t  as  the  estimator)  has  shovm  that  univariate  covariance 
P 

adjustment  will  either  increase  the  bias  by  up  to  304%  or  overcorrect  for  bias 
by  298%  for  certain  exponential  response  surfaces  if  the  variances  of  x  in 
the  treated  and  control  groups  differ  by  a  2:1  ratio.  Unequal  variances  of 
covariates  are  not  uncommon  in  observational  studies,  since  the  subset  of 
units  which  receives  a  new  treatment  is  often  more  homogeneous  than  the 
general  population.  For  example,  in  the  observational  half  of  the  Salk 
Vaccine  trial,  the  parents  of  second  graders  who  volunteered  for  vaccination 
had  higher  and  therefore  less  variable  educational  achievement  (x)  than 
parents  of  control  children,  that  is  parents  of  all  first  and  third  graders 
(Meier  (1972)). 

In  the  case  of  multivariate  normal  x,  Rubin  (1979,  table  2)  has  shown 
that  covariance  adjustment  can  increase  the  expected  squared  bias  by  as  much 
as  55%  if  the  covariance  matrices  in  the  treated  and  control  groups  are 
unequal;  that  is,  if  the  discriminant  is  not  a  monotone  function  of  the 
propensity  score.  In  contrast,  when  the  covariance  matrices  are  equal,  so  the 
discriminant  is  a  monotone  function  of  the  propensity  score,  covariance 
adjustment  removes  between  84%  and  100%  of  the  expected  squared  bias  in  the 
cases  considered  by  Rubin  (1979,  table  2). 
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